Analysis of affordability and safety of housing close to LSE

Contents¶

    1. Introduction
    • 1.1. Motivation
    • 1.2. Research Questions
    • 1.3. Originality
    • 1.4. Overview of Method
    1. Data Acquisition
    • 2.1. OpenRent General Webpage Scraping
    • 2.2. OpenRent Individual Listing Webpage Scraping
    • 2.3. Adding Crime Information from Police.uk API
    • 2.4. Scraping Data for LSE Halls
    1. Prepare the Data for Analysis
    • 3.1. Adding Private Property Data to a DataFrame
    • 3.2. Adding LSE Accommodations to DataFrame
    1. Data Analysis
    • 4.1. Summary Statistics and Final Data Corrections
    • 4.2. General Correlation
    • 4.3. Effects of no. bedrooms on cost per bedroom
    • 4.4. Number of affordable properties and safety by postcode
    • 4.5. Effects of distance from LSE on price
      • 4.5.1. Regression analysis
      • 4.5.2. Regression results
      • 4.5.3. Further regression evaluation
      • 4.5.4. Regression excluding crime numbers and properties over £2000
    • 4.6. Comparison to LSE accommodation
      • 4.6.1. Regression Analysis and Results
      • 4.6.2. Visualisation of regressed dataset
      • 4.6.3. Visualisation of the whole dataset
    1. Conclusion
    • 5.1. Summary of Results
    • 5.2. Interpretation of Results
    • 5.3. Limitations
    • 5.4. Potential Further Analysis
    1. References

1. Introduction¶

1.1. Motivation¶

Rental prices in London have been increasing for years, with average monthly rent in the city rising 11.8% in 2023, to £2,425. With LSE barely having sufficient space in student halls for just its first year of undergraduates, the majority of students are forced to move into either private student halls or their own privately rented flat or house for their subsequent years of study. With so many students having to make this switch each year, we set out to gain some data to help in the search for reasonable, affordable, and safe accommodation.

1.2. Research Questions¶

The goal of the study is to provide knowledge on how various factors affect the costs of property around LSE, and tactics to find more affordable accommodation. The main questions to be answered are:

  • How does distance from LSE affect the cost of flats per bedroom?
  • Are there postcode areas that are especially affordable considering their proximity, and is this influenced by crime statistics in the area?
  • Flats with which number of bedrooms have the lowest cost per bedroom?
  • Is moving out of LSE halls cheaper or more expensive than living in them?

1.3. Originality¶

Whilst there are some resources from LSE to help students find private accommodation to live in, these mostly direct students to housing events and property websites from which to search for properties, without any specific information on what sorts of property to seek out, and in which areas. Our approach is much more data driven, with data from hundreds of properties around LSE, and so aims to help inform students with specific strategies to find the most affordable accommodation.

1.4. Overview of Method¶

The method we will use is to scrape data on available rental properties within a 4km radius of LSE from OpenRent, one of London’s leading online property search websites. Each property will have a number of crimes within a 1 mile radius for the last available month found through the use of the police.uk API. Similarly, we will scrape data on LSE halls from the LSE accommodation website, and use the police.uk API to find crime statistics for these too. The two sets of data will be combined into a dataframe, from which we can create summary statistics, visual elements such as graphs, and perform regression analysis.

2. Data Acquisition¶

Reference the following ranking to identify the most reliable websites which are representative of the London housing market. The first 2 - Zoopla and RightMove - have rules against web scraping. As a result we proceed to scrape data from the 3rd place holder in the ranking - OpenRent. For our analysis we will need data for all properties about their price, number of bedrooms, potentially bathrooms, maximum number of tenants, distance from LSE, listing/property type, postcode, and number of crimes around it. For the last part we will utilize police.uk's api, which requires coordinates as input.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re # Import library for string manipulation
import string
import time # Import libraries in case we need to make the requests seem less robotic
import random
# Use Selenium to add options to the browser, for conditional waiting, scrolling and more.
# Most importantly, we use it because the website has dynamic content for which JavaScript is utilised.
# With Selenium we can read that content and work with it alongside the normal html code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import threading
from threading import Thread
import queue
import copy
import json

2.1. OpenRent General Webpage Scraping¶

  • Something we notice while browsing through the OpenRent website is that not all the content loads in the beginning, i.e. it is supposed to show all properties, but it shows just a few of them and the option to browse through different pages of properties is broken. If we scroll all the way to the bottom, however, more properties load and if we continue doing that, eventually we reach the bottom of the page and all properties are displayed. This makes it a bit easier for us as all the data is on only one page. We just have to use *Selenium to scroll* for us.
  • For the initial link that we use we will define and insert the parameters, so that the code in this project can be implemented for any school or any other insitution (even workplace) that is not LSE by simply changing the postcode and the name of the area. One can also filter the properties based on distance from that school/institution, min/max price, min/max bedrooms in the flat, and more advanced filters.
    • For our analysis the default options we choose are properties within 4 km from LSE (postcode WC2A 2AE), have maximum 6 bedrooms, and accept students. We also set min price to be £100 per month so that we filter out unserious listings or properties without a price.
    • Note that the code below takes approximately 1 minute to execute.
In [2]:
# Choose your preferred inputs below. These are for LSE:
postcode = "WC2A 2AE"
area = "Westminster"
county = "Greater London"
radius = "4"
prices_min = "100"
prices_max = "" # We do not add limits on the price as we want to analyse the price. If users want, add " &prices_max=x" to the url.
bed_max = "6"
accept_students = "true"

# Manipulate to get desired formats
outcode = postcode.split()[0]; incode = postcode.split()[-1] # Split postcode into outcode and incode
full_area = re.sub(r'\s+', "-", f'{area} {county}') # Join area and county and substitute spaces with "-"
area_url = re.sub(r'\s+', "%20", area); county_url = re.sub(r'\s+', "%20", county) # Substitute spaces with '%20' in area and county.

# Set up Chrome options to make running the url headless, i.e. run the commands without actually opening a browser window.
# This way we speed the process as there's no need for all the visual representation:
chrome_ops = Options()
chrome_ops.add_argument("--headless")

# Choose chrome as our browser and add the headless option:
browser = webdriver.Chrome(options=chrome_ops)

# Define the url with the parameters above and get it:
url_1 = f'https://www.openrent.co.uk/properties-to-rent/{outcode.lower()}-{incode.lower()}-{full_area.lower()}?term={outcode}%20{incode}%20{area_url},%20{county_url}&area={radius}&prices_min={prices_min}&bedrooms_max={bed_max}&acceptStudents={accept_students}'
browser.get(url_1) # Open the url in the browser

# Wait for the whole body to become visible:
wait = WebDriverWait(browser, 30) # Set a maximum waiting time of 30 seconds
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "body")))

# Scroll until the height 1 second after we scroll is the same as the height before the scroll.
# This shows us when there are no more properties to load.
# Make the loop infinite as the variables of the condition have to be defined inside of the loop and then "break" it:
while "We are" != "ST115 superstars":
    old_h = browser.execute_script("return document.body.scrollHeight") # Find the height before the scroll

    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Scroll to the bottom
    time.sleep(1)
    
    new_h = browser.execute_script("return document.body.scrollHeight") # Find the height after the scroll and the load of 1 sec
    if new_h == old_h:
        break

# Get the page source after we have finished scrolling and it has fully loaded:
html_code = browser.page_source

# Close the driver to avoid it running in the background:
browser.quit()

# Make the regular soup out of the html code:
main_soup = BeautifulSoup(html_code, 'lxml')

# Save the code to a file:
with open("main_property_page_code.html", 'w', encoding='utf-8') as file:
    file.write(html_code)
#print(main_soup)

Let's iterate through the properties and get the data we need directly from the preview window. We will also open the links for every individual property, but some of the data, such as the distance from the specified postal code, cannot be found in the individual pages. This is because they are generic property pages, while our main page is filtered the way we want it and, therefore, shows distance from input address.

  • Note that the <a> with class "pli clearfix" are available to rent properties. Those which are no longer available, i.e. those titled "Let Agreed", have <a> with class "pli l-a clearfix". Therefore, with the first line below, we only chose the currently available properties.
  • Also to get the individual links foreach property listing, we note that they are in the format https://www.openrent.co.uk/xxxxxxx where xxxxxxx is the property id, which is unique.

We can also check whether the number of properties in the html code we got is the same as the number shown at the top of the webpage. We do that by adding up all available and all "Let Agreed" properties:

In [3]:
# Find the number of all properties we analyse + the number of properties that are no longer available:
all_properties = len(main_soup.find_all("a", class_ = "pli clearfix")) + len(main_soup.find_all("a", class_ = "pli l-a clearfix"))
# Get the number of total properties stated at the top of the page:
number_of_properties_on_page = int(main_soup.find("div", class_="contentPane__content transparent").find("div", id='top-detail-bar').find("div", class_='search-detail').find("span", class_='filter-info').find("span").get_text().strip())

print(f"Our approach is correct - \033[1m{all_properties==number_of_properties_on_page}\033[0m - and the total number of properties is \033[1m{all_properties}\033[0m.")
Our approach is correct - True - and the total number of properties is 787.

Let's proceed with the scraping below:

  • Note that the code below has been executed successfully on 24/04/2024 around 11pm. As a result, the data is from that timeframe.
In [4]:
# Find all sections for different properties on the main page:
all_available_props = main_soup.find_all("a", class_ = "pli clearfix")

def get_distance(prop):
    dist = prop.find("div", class_ = "price-location clearfix").find("div", class_="ltc pl-title").find("h2").get_text().strip()
    return dist
    
def get_price(prop):
    price = prop.find("div", class_="price-location clearfix").find("div", class_="pim pl-title").find("h2").get_text().strip()
    return price
    
def get_type_and_outcode(prop):
    title = prop.find("div", class_="location-description").find("span").get_text().strip()
    title_split = title.split(',') # Make a list with the components in the title
    prop_type = title_split[0].strip().lower() # Take the first component (property type) and make it lowercase
    outcode = title_split[-1].strip() # Take the outcode (first part of postcode)
    return [prop_type, outcode]
    
props = {} # Create an empty list to populate with the data
# Iterate through them and get the info we require:
for prop in all_available_props:
    prop_id = prop["href"].strip("/")
    prop_url = "https://www.openrent.co.uk/" + prop_id
    type_and_outcode = get_type_and_outcode(prop)
    # Put a list of the property url, price and distance in the dictionary, use the id for a key:
    props[prop_id] = [prop_url, get_price(prop), get_distance(prop), type_and_outcode[0], type_and_outcode[-1]]
    
#display only first 3 listings from the resulting dictionary:
display({k:props[k] for k in list(props)[:3]})
{'2016944': ['https://www.openrent.co.uk/2016944',
  '£7,800 per month',
  '0.17  km',
  '3 bed flat',
  'WC2A'],
 '2043123': ['https://www.openrent.co.uk/2043123',
  '£2,396 per month',
  '0.31  km',
  'studio flat',
  'WC2B'],
 '2016841': ['https://www.openrent.co.uk/2016841',
  '£5,633 per month',
  '0.35  km',
  '2 bed flat',
  'WC2R']}
  • As shown above, for now we have the property's url, price, distance from LSE, type, and outcode in this order. Now the remaining info we will get from the individual webpages, namely, bedrooms, max tenants, bathrooms, and coordinates of property:

2.2. OpenRent Individual Listing Webpage Scraping¶

  • The individual websites can be scraped with requests to get the number of bedrooms, bathrooms, and maximum tenants. However, the website has anti-spamming policies that temporarily ban the IP address when too many requests are made. If we put a time.sleep() we manage to get the data eventually.
  • Although that code would run within a 'reasonable' timeframe, it is limited in the fact that it cannot interact with the dynamic content of the page written in JavaScript. Why do we need that?
  • In order to get the coordinates which we will need for the API in 2.3., we will need to click on an image of the webpage. As can be seen on this individual property webpage for example, there is only a picture of the map of the area around the property and if we click on it, it turns into an interactive map while preserving the same URL. The way this is reflected in the “html” code is that before clicking it, we have <div id=”googleMapPartialContainer”>, which is empty, but after we click on it, this div gets populated with JavaScript code, in which a variable called latlng is defined with the coordinates of the property. We can do the clicking on the image and getting the coordinates with Selenium - similar to the scrolling we did in 2.1.
    • This would take substantially more time as the number of properties, and therefore, the number of URLs to open, can be almost 1000 (depending on the current availability and the input parameters). It would, however, make our analysis of the crime numbers more accurate as the police.uk API would give us crime numbers in 1 mile radius of these coordinates. This is more relevant than if we were to use the outcode of the property, get the coordinates of the center of the area corresponding to that outcode, and use that as the input for the API. This alternative approach may prove inappropriate as some properties may be far away from the center of the outcode they are in. As a result, using Selenium to get the coordinates may be worth the time!
    • See the number of websites we need to scrape below:
In [6]:
print(f'We are currently analysing \033[1m{len(props)}\033[0m properties.') # Note these are the still-available ones.
We are currently analysing 753 properties.
  • We can complete the scraping more quickly using multithreading. Multithreading will take a batch of URLs and open a tab in the browser, click on the map, wait for it to load, find the coordinates, and close the browser for all URLs within the batch at the same time, because there are X 'workers' performing the task at the same time. We create a queue with the URLs, from which the workers take URLs to work on, regardless of what is going on with the other workers. The queue is useful to keep the process structured - it is like the 'project manager' distributing the tasks to the workers.
    • Below we define the task described above. Note that the image that we click on is with id = "staticGoogleMap".
    • Regarding the number of workers - we choose that based on our needs. Theoretically, we can set 753 workers, which means that there is a worker for each URL. So in theory, our code would take only 12-13 seconds to run (the time it takes for the code defined in the task() function below to execute for one URL). In practice, however, this would take a lot of computer power and memory and more importantly it would cause our IP address to get temporarily banned by the website due to spam. Even if we choose 10 workers (therefore reducing the total time it takes to 753 * 12sec / 10 / 60 = 15 min, our IP is likely to get banned. As a result we use only 2 workers.
      • Note that because of the current number of workers, the code below takes approximately 70 minute to execute.
    • We also incorporate Exceptions into the code, so that if there are any issues, the code continues to run by skipping the URL, the scraping of which causes an exception. This is for debugging purposes. If the code is being re-run, no exceptions should be printed in the output unless the number of workers is increased or the website tightens its spamming policy.
  • The code below has been executed successfully on 24/04/2024 around 11pm. As a result, the data is from that timeframe.
    • Note that the number of workers can be changed to speed up the process, but we would not recommend running the code with more than 2 workers as the IP may get banned.
In [ ]:
%%prun

props_i = copy.deepcopy(props)

class TaskQueue(queue.Queue):

    # Inherit from the Queue.Queue class to get access to the get, put methods and the queue behaviour:
    def __init__(self, num_workers=1): # Set some default number of workers which we will change in tests()
        queue.Queue.__init__(self)
        self.num_workers = num_workers
        self.start_workers()

    # Store the task, args, kwargs as tuples in the queue.
    # Note that *args allows passing variable number of arguments and **kwargs allows named arguments
    def add_task(self, task, *args, **kwargs):
        args = args or ()
        kwargs = kwargs or {}
        self.put((task, args, kwargs))

    # Create threads for each of the workers and fire them off in background mode:
    def start_workers(self):
        for i in range(self.num_workers):
            t = Thread(target=self.worker)
            t.daemon = True
            t.start()

    # Add the actual worker code.
    # Note that the worker basically just gets the first task in the queue and runs it.
    def worker(self):
        while True:
            item, args, kwargs = self.get()
            item(*args, **kwargs)  
            self.task_done()

# To test the queue but also run the whole threading:
def tests():
    # Define the actual task:
    def task(props_i, *args, **kwargs):
        # Open the chrome
        chrome_ops = Options()
        chrome_ops.add_argument("--headless")
        browser = webdriver.Chrome(options=chrome_ops)
        url = str(args[0]) # Get URL
        #print("open br")
        try:
            browser.get(url)
            # Find the static google maps picture and click on it:
            browser.find_element(By.XPATH, "//img[@id='staticGoogleMap']").click()
            #print("click on element")

            # Wait until the interactive map appear and set max waiting time:
            WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.XPATH, "//div[@id='googlemapandstreetview']")))
            code = browser.page_source # Get the html code
            soup = BeautifulSoup(code, 'lxml') # Turn it into a soup
            #print("element located successfully")
            
            # Get the overview table:
            table = soup.find("table", class_="table table-striped intro-stats")
            cells = table.find_all("td") # Get each cell with data
            #print("found table")
            
            # Make 2 lists - 1 with the type of data - bedroom/bathroom/max tenants and one with the values.
            # Note that we do not need the location cell, which is the last one out of 4.
            data_var = [cell.find("span").get_text(strip=True).strip(":") for cell in cells[:3]] # Strip spaces and ":"
            value = [cell.find("strong").get_text(strip = True) for cell in cells[:3]]
            bbt = dict(zip(data_var, value)) # Make a dictionary out of it

            # Find the coordinates by searching for a pattern in the code string. Note we use code and not soup!
            # Note that there is a line of code that defines a variable and inputs the coordinates always:
            # The line in the JavaScript code is "var latlng = new L.LatLng(latitude, longitude);"
            var_latlng = r'new L\.LatLng\(([^,]+),([^)]+)\);'
            match = re.search(var_latlng, code) # Find the pattern in the code string
            lat = float(match.group(1))
            lng = float(match.group(2))

            # Get the id from the url and use it to reference the dictionary and append the corresponding list with our bbt dict.
            prop_id = url[len("https://www.openrent.co.uk/"):]
            props_i[prop_id].append(bbt)
            props_i[prop_id].append([lat, lng])
            #print("dict appended")
        
        except Exception as e:
            # Handle the exception gracefully
            print("An exception occurred:", e)
            # Perform actions or logging specific to the exception

            # Increment a counter to keep track of the number of occurrences
            # You can define the counter variable outside the task function
            global exception_count
            exception_count += 1
        
        finally:
            browser.quit() # Close the browser
            #print("close br")
            
    # CHOOSE THE NUMBER OF WORKERS:
    q = TaskQueue(num_workers=2)

    for prop in list(props_i.values()): # Scrape the list from the dictionary for urls and keep adding tasks
        url = prop[0]
        q.add_task(task, props_i, url)

    q.join() # Block until all tasks are done
    # Save the dictioinary in a JSON file so that we have the data saved and we do not have to rerun every time:
    with open("data/private_property_info.json", "w") as json_file:
        json.dump(props_i, json_file)
    print("All done")
    display({k:props_i[k] for k in list(props_i)[:3]})

if __name__ == "__main__":
    global exception_count
    exception_count = 0
    tests() # Here we call the tests function which runs everything if the script is executed directly
    print(f"There has been an exception for {exception_count} of the properties")
All done
{'2016944': ['https://www.openrent.co.uk/2016944',
  '£7,800 per month',
  '0.17  km',
  '3 bed flat',
  'WC2A',
  {'Bedrooms': '3', 'Bathrooms': '2', 'Max Tenants': '5'},
  [51.51498, -0.1148634]],
 '2043123': ['https://www.openrent.co.uk/2043123',
  '£2,396 per month',
  '0.31  km',
  'studio flat',
  'WC2B',
  {'Bedrooms': '1', 'Bathrooms': '1', 'Max Tenants': '1'},
  [51.514355, -0.1211351]],
 '2016841': ['https://www.openrent.co.uk/2016841',
  '£5,633 per month',
  '0.35  km',
  '2 bed flat',
  'WC2R',
  {'Bedrooms': '2', 'Bathrooms': '2', 'Max Tenants': '4'},
  [51.511524, -0.1136339]]}
There has been an exception for 0 of the properties
 

We save the data to a JSON file so that we do not have to re-run the code to access it.

2.3. Adding Crime Information from Police.uk API¶

  • We can now use the coordinates we obtained in 2.2 to obtain data on the number of crimes reported within a mile radius of the property within a specific month, giving an impression of the relative safety of the property's location.
    • We could restrict this method to certain types of crime, but all crime types (given below) are likely to negatively impact living in the area. Another option would be to weight each type of crime, but doing this would be fairly arbitrary.
    • Another extension could be to get the number of crimes for a whole year and then get the average monthly number for that year, but due to the amount of time that would consume and the relatively low volatility in crime numbers, we simply use the latest available month. Even if there is some volatility, typically the crime numbers reach their peak around the end of the year and experience a drop in January, followed by gradual decrease until it starts increasing again around the middle of the year. As a result, January is a representative month of the monthly average.
      • This was found through extra analysis, which has been omitted for conciseness purposes.
  • We use the government-provided police.uk API to obtain this data.
  • First, we write a function that accepts coordinated and a desired year-month pair in 'YYYY-MM' form and returns the total number of crime reports given.
In [3]:
#create a function for getting a crimes number from given coordinates and month
def coord_crime_num_month(lat,long,month):
    
    #get all crime data from coordinates and every month, using police.uk API
    monthcrimecount = len(requests.get(f'https://data.police.uk/api/crimes-street/all-crime?lat={lat}&lng={long}&date={month}').json())
        
    #return the total number of crimes that data is given for
    return(monthcrimecount)

The function to get data for a whole year taking as input the coordinates and the year of interest is shown below for completeness, but note that we do not use it throughout the report.

In [4]:
#create a function for getting a crimes number from given coordinates and year
def coord_crime_num_year(lat,long,year):
    
    #get all crime data from coordinates and every month, using police.uk API, and sums them for total crimes in the year
    yearcrimecount = 0
    for month in range(1,13):
        yearcrimecount += len(requests.get(f'https://data.police.uk/api/crimes-street/all-crime?lat={lat}&lng={long}&date={year}-{month}').json())
        
    #return the total number of crimes that data is given for
    return(yearcrimecount)
  • Next we open the json file we saved in 2.2, use the above function to add crime data from the most recent month with data (Jan 2024), and save the updated file under a new name
    • Note that the code below takes approximately 40 minutes to execute.
In [ ]:
#open previous json file 
with open('data/private_property_info.json') as f:
    props_c = json.load(f)

#iterate through each property and extract coordinates
for prop in props_c: 
    coords = props_c[prop][-1]
    
    #define lat and long from coords
    lat = coords[0]
    long = coords[1]
    
    #run crime number API function using given coords, in most recent month available
    crime_num = coord_crime_num_month(lat,long,'2024-01')
    
    #append area crimerate to end of property info list
    props_c[prop].append(crime_num)

#save dictionary in a JSON file so that we have the data saved and we do not have to rerun every time
with open('data/private_property_info_with_crimes_number.json', 'w') as json_file:
    json.dump(props_c, json_file)
    
print('All done')
All done
 

We append the initial dictionary and save it to another JSON file again so that we can access it.

2.4. Scraping Data for LSE Halls¶

  • Let's also scrape the prices of single rooms in LSE halls available to undergraduates as a point of comparison for flat prices. We have taken single rooms to be the closest comparison to a room in a flat, as very few flat listings are for a shared bedroom, but many involve sharing a bathroom with a flatmate.
  • First we scrape the LSE accommodation 'search' page for the URLs for all relevant halls.
In [9]:
#create empty list to fill with URLs for each relevant hall.
url_list = []

#there are two pages of results from the query we submitted for the URL, so will need to iterate through both pages' URLs
for index in range(1,3):
    
    #uses f-string for url to substitute in the page number, then retrieves the page's 'soup'
    url = f'https://www.lse.ac.uk/student-life/accommodation/search-accommodation?collection=lse-accommodation&pageIndex={index}&roomType=451ee418-b4c2-4727-8ea8-735dd74f33ce&sort=metaavailability%20%20&studentType=95ff59b3-9e8a-4ad6-83bd-f4c3357b2def'
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    
    #iterate through each 'card', finds next 'a' element's href, which is the URL of the hall's webpage, then adds to the list
    for item in soup.find_all('h2',attrs={'class':"card__title"}):
        url = item.find_next('a')['href'].strip('.aspx')
        url_list.append(url)
  • Now we have a list of the URLs of all LSE halls that have single rooms available to undergraduate students, we can iterate through each and scrape all the revelant information, mirroring the information we obtained for the private rental properties previously.
  • We have to make use of the postcodes.io API to translate the accommodation's postcode, the most precise available location data on the webpage, into coordinates that can be fed into the crime number API function defined previously.
  • At the end we format the data as a PD dataframe and save it as a csv file, so that we can access the data without rerunning this code.
    • Note that the code below takes approximately 1.5 minutes to execute.
In [ ]:
#create empty pd dataframe to fill with all possibly useful info for each hall
halls_i = pd.DataFrame(columns=['hall','(avg) cost per week','contract length','(avg) contract cost','distance from LSE','crimes number','outcode'])

#iterate through each url in the list
for url in url_list:
    
    #retrieve soup for the url
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    
    #find name of hall from herobanner title
    name = soup.find('h1',attrs={'class':'heroBanner__title'}).get_text()
    
    #find distance from LSE from herobanner content
    distance_text = soup.find('div',attrs={'class':'accommKeyDetails__dist'}).get_text()
    distance = distance_text.strip('km').strip().split()[-1]
    
    #finds piece of text with contract length, the third word of which is always the contract length itself
    contract_text = soup.find('p',attrs={'class':'accommKeyDetails__contract'}).get_text()
    contract_length = int(contract_text.split()[2])
    
    #finds location of where the single room information is
    #had to use two checks as some are listed as 'Single room', others as 'Single Room '
    single_location = soup.find('h2',attrs={'class':'roomlist__title'},string='Single room') 
    if single_location == None:
        single_location = soup.find('h2',attrs={'class':'roomlist__title'},string='Single room ')
    
    #finds next 'room at a glance price' element, and extracts the text from the element
    price_info = single_location.find_next('p',attrs={'class':'roomataGlance__price'})
    price_info_text = str(price_info.get_text()).split()
    
    #some have a single price and some contain a range of prices
    #single prices have a length of 3, with the second word being the price per week
    if len(price_info_text)==3:
        price = price_info_text[1].strip('£')
        
    #ranges of prices have a length of 5, with the lower in the second and upper in the fourth
    #I took a mean of the upper and lower bound as a representative single price
    if len(price_info_text)==5:
        price = (float(price_info_text[1].strip('£'))+float(price_info_text[3].strip('£')))/2
    
    #calculates contract cost by multiplying contract length and price
    contract_cost = float(contract_length) * float(price)
    
    #find postcode so that a crimes number can be calculated
    address_text = soup.find('h2',attrs={'class':'_mce_tagged_br'}).find_next('p').get_text()
    postcode = address_text.split('London,')[1].strip()
    
    #from testing, I have found that the webpage for garden halls has a typo in the postcode - it reads WC1H 9EB, but should read WC1H 9EN.
    #we have created a line of code to correct this error.
    if postcode == 'WC1H 9EB':
        postcode = 'WC1H 9EN'
        
    #we will need outcode so that we can compare rough locations with rental properties
    outcode = postcode.split()[0]
    
    #use postcodes.io API to find coordinates for the given postcode
    api_response = requests.get(f'https://api.postcodes.io/postcodes/{postcode}').json()
    lat = api_response["result"]["latitude"]
    long = api_response["result"]["longitude"]
    
    #input coordinates into crimes number API function to return a crimes number
    crime_num = coord_crime_num_month(lat,long,'2024-01')
    
    #adds new row to the halls_i dataframe
    halls_i.loc[len(halls_i.index)] = [name,price,contract_length,contract_cost,distance,crime_num,outcode]

#save dataframe as a csv file so that it can be accessed without running code again
halls_i.to_csv('data/lse_accomm_info.csv', sep=',', index=False, encoding='utf-8')
 

Save the PD DataFrame as a CSV file to be accessed later.

3. Prepare the Data for Analysis¶

In this section we will take the JSON file containing property information and the PD dataframe containing information about LSE accommodation and merge them into a single PD dataframe, upon which we can later perform analysis.

3.1. Adding Private Property Data to a DataFrame¶

  • First we add the data from private house/flat rentals to the dataframe.
  • When calculating the price per bedroom, we should note that some listings provide information as if couples will be staying in the room - for instance, 2 bedrooms but 4 max tenants. Therefore, in this case we should calculate it as price per bedroom. Moreover, some are for a room in a shared flat - for example, a property has max tenant 1 but 3 bedrooms. Then we should calculate it as price/max number of tenants. => Calculate as total price/min{max tenants, bedrooms}
    • This calculation still will provide misleading information as some "Room in a Shared Flat/House" listings may provide the total number of bedrooms and the total number of tenants, e.g., 5 bedrooms, 5 max tenants. However, only 1 of the bedrooms is offered and the price stated is only for 1 bedroom. Therefore, for properties, the title of which contains "Shared", we take the stated price as the price per bedroom.
In [ ]:
#create empty pd dataframe to house all data
all_info = pd.DataFrame(columns=['Name','Outcode','Price per month','Price per bedroom per month','Price per bedroom per year','No. bedrooms','Distance from LSE','Crime No.','LSE accommodation'])

#open properties with crimes number json file as props_p
with open('data/private_property_info_with_crimes_number.json') as f:
    props_p = json.load(f)
    
#iterate through each property, gathering datapoints to be put into pd dataframe
for prop in props_p:
    name = props_p[prop][3]
    
    outcode = props_p[prop][4]
    
    #need to strip text out of price per month information, to obtain a number
    ppm_text = props_p[prop][1]
    ppm = int(ppm_text.strip('£').strip(' per month').replace(',',''))
    
    no_bedrooms = int(props_p[prop][5]['Bedrooms'])
    
    #create if statement to define whether house/flat is shared
    shared_house = False
    name_words = name.lower().split()
    if 'shared' in name_words:
        ppm_per_room = ppm
    else:
        ppm_per_room = ppm/no_bedrooms
        
    ppy_per_room = 12*ppm_per_room
        
    dist_lse = float(props_p[prop][2].split()[0])
    
    crime_num =  props_p[prop][7]
    
    #signal that property is not an LSE hall
    lse_accomm = False
    
    #append info to pd dataframe
    all_info.loc[len(all_info.index)] = [name,outcode,ppm,ppm_per_room,ppy_per_room,no_bedrooms,dist_lse,crime_num,lse_accomm]
    
all_info.head()
Out[ ]:
Name Outcode Price per month Price per bedroom per month Price per bedroom per year No. bedrooms Distance from LSE Crime No. LSE accommodation
0 3 bed flat WC2A 7800 2600.0 31200.0 3 0.17 5678 False
1 studio flat WC2B 2396 2396.0 28752.0 1 0.31 6949 False
2 2 bed flat WC2R 5633 2816.5 33798.0 2 0.35 5585 False
3 1 bed flat WC2A 2500 2500.0 30000.0 1 0.40 5134 False
4 1 bed flat EC4A 2600 2600.0 31200.0 1 0.48 4727 False

3.2. Adding LSE Accommodations to DataFrame¶

  • Next, we add the data we collected from the websites for LSE accommodation. As the data was created as a pd dataframe before being stored as a csv, we can iterate through each row and insert the data into the above dataframe.
In [ ]:
#open csv file as a pd dataframe
halls_p = pd.read_csv('data/lse_accomm_info.csv')
halls_p
Out[ ]:
hall (avg) cost per week contract length (avg) contract cost distance from LSE crimes number outcode
0 urbanest Westminster Bridge 297.995 39 11621.805 1.5 3219 SE1
1 College Hall 289.730 40 11589.200 1.2 7655 WC1E
2 International Hall 266.280 40 10651.200 1.0 6247 WC1N
3 Bankside House 259.700 39 10128.300 1.5 2392 SE1
4 Carr-Saunders Hall 257.250 31 7974.750 1.6 7544 W1T
5 Connaught Hall 273.630 40 10945.200 1.3 6760 WC1H
6 High Holborn Residence 317.800 39 12394.200 0.5 7225 WC1V
7 Passfield Hall 252.875 31 7839.125 1.5 6240 WC1H
8 Nutford House 250.180 40 10007.200 3.1 2803 W1H
9 Rosebery Hall 255.850 39 9978.150 1.6 2646 EC1R
10 The Garden Halls 279.195 40 11167.800 1.6 4778 WC1H
  • Now that the dataframe containing halls information is loaded, we can add each row to the all_info dataframe.
In [13]:
#iterate through the content of each row of the halls dataframe
for index,row in halls_p.iterrows():
    
    #label each piece of content from row by the way it will be addes to all_info dataframe
    
    name = row['hall']
    outcode = row['outcode']
    ppy = int(row['(avg) contract cost'])
    dist_lse = row['distance from LSE']
    crime_num = row['crimes number']
    
    #calculate ppm as price per week multiplied by avg weeks per month
    ppm = int(row['(avg) cost per week']*(52/12))
    ppm_per_bedroom = ppm
    
    #all of these rows are for 1 bedroom, and are a part of lse halls 
    no_bedrooms = 1
    lse_accomm = True
    
    #append row to end of all_info dataframe
    all_info.loc[len(all_info.index)] = [name,outcode,ppm,ppm_per_bedroom,ppy,no_bedrooms,dist_lse,crime_num,lse_accomm]
In [15]:
#As can be seen below, the LSE accommodations are added to the bottom of the pandas df:
all_info[-13:]
Out[15]:
Name Outcode Price per month Price per bedroom per month Price per bedroom per year No. bedrooms Distance from LSE Crime No. LSE accommodation
751 1 bed flat SE5 3990 3990.0 47880.0 1 4.0 1388 False
752 2 bed flat SW11 4000 2000.0 24000.0 2 4.0 1124 False
753 urbanest Westminster Bridge SE1 1291 1291.0 11621.0 1 1.5 3219 True
754 College Hall WC1E 1255 1255.0 11589.0 1 1.2 7655 True
755 International Hall WC1N 1153 1153.0 10651.0 1 1.0 6247 True
756 Bankside House SE1 1125 1125.0 10128.0 1 1.5 2392 True
757 Carr-Saunders Hall W1T 1114 1114.0 7974.0 1 1.6 7544 True
758 Connaught Hall WC1H 1185 1185.0 10945.0 1 1.3 6760 True
759 High Holborn Residence WC1V 1377 1377.0 12394.0 1 0.5 7225 True
760 Passfield Hall WC1H 1095 1095.0 7839.0 1 1.5 6240 True
761 Nutford House W1H 1084 1084.0 10007.0 1 3.1 2803 True
762 Rosebery Hall EC1R 1108 1108.0 9978.0 1 1.6 2646 True
763 The Garden Halls WC1H 1209 1209.0 11167.0 1 1.6 4778 True
In [ ]:
all_info.to_csv('data/private_and_lse_final_info.csv', sep=',', index=False, encoding='utf-8')

Save the PD DataFrame with all the info we need for analysis in a CSV file.

4. Data Analysis¶

In [2]:
# Import extra libraries for visualisation and analysis:
import seaborn as sns
import plotly.express as px
import statsmodels.formula.api as sm
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
In [ ]:
all_i = pd.read_csv('data/private_and_lse_final_info.csv')
display(all_i)
Name Outcode Price per month Price per bedroom per month Price per bedroom per year No. bedrooms Distance from LSE Crime No. LSE accommodation
0 3 bed flat WC2A 7800 2600.0 31200.0 3 0.17 5678 False
1 studio flat WC2B 2396 2396.0 28752.0 1 0.31 6949 False
2 2 bed flat WC2R 5633 2816.5 33798.0 2 0.35 5585 False
3 1 bed flat WC2A 2500 2500.0 30000.0 1 0.40 5134 False
4 1 bed flat EC4A 2600 2600.0 31200.0 1 0.48 4727 False
... ... ... ... ... ... ... ... ... ...
759 High Holborn Residence WC1V 1377 1377.0 12394.0 1 0.50 7225 True
760 Passfield Hall WC1H 1095 1095.0 7839.0 1 1.50 6240 True
761 Nutford House W1H 1084 1084.0 10007.0 1 3.10 2803 True
762 Rosebery Hall EC1R 1108 1108.0 9978.0 1 1.60 2646 True
763 The Garden Halls WC1H 1209 1209.0 11167.0 1 1.60 4778 True

764 rows × 9 columns

4.1. Summary Statistics and Final Data Corrections¶

  • Our DataFrame includes *ordinal qualitative* - the boolean value whether it is LSE accommodation or not, *nominal qualitative* - the name of the property and the outcode, *discrete quantitative* - number of bedrooms and crime number, and *continuous quantitative* - all else.
  • Let's first observe the info about the data with .info() and the summary statistics with .describe().
  • We also check whether we have empty cells - "None" values:
In [4]:
display(all_i.info())
display(all_i.describe(include="all").round(2)) # Round for readability
display(all_i.isna().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 764 entries, 0 to 763
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Name                         764 non-null    object 
 1   Outcode                      764 non-null    object 
 2   Price per month              764 non-null    int64  
 3   Price per bedroom per month  764 non-null    float64
 4   Price per bedroom per year   764 non-null    float64
 5   No. bedrooms                 764 non-null    int64  
 6   Distance from LSE            764 non-null    float64
 7   Crime No.                    764 non-null    int64  
 8   LSE accommodation            764 non-null    bool   
dtypes: bool(1), float64(3), int64(3), object(2)
memory usage: 48.6+ KB
None
Name Outcode Price per month Price per bedroom per month Price per bedroom per year No. bedrooms Distance from LSE Crime No. LSE accommodation
count 764 764 764.00 764.00 764.00 764.00 764.00 764.00 764
unique 35 68 NaN NaN NaN NaN NaN NaN 2
top 2 bed flat NW1 NaN NaN NaN NaN NaN NaN False
freq 198 93 NaN NaN NaN NaN NaN NaN 753
mean NaN NaN 2638.11 1814.00 21713.52 2.09 2.80 2639.82 NaN
std NaN NaN 1436.35 856.71 10331.14 1.22 0.87 1493.49 NaN
min NaN NaN 650.00 650.00 7800.00 1.00 0.17 1124.00 NaN
25% NaN NaN 1500.00 1200.00 14400.00 1.00 2.19 1764.00 NaN
50% NaN NaN 2449.00 1586.17 19034.00 2.00 3.03 2249.00 NaN
75% NaN NaN 3300.50 2275.00 27300.00 3.00 3.48 2631.25 NaN
max NaN NaN 8333.00 8333.00 99996.00 7.00 4.00 7873.00 NaN
Name                           0
Outcode                        0
Price per month                0
Price per bedroom per month    0
Price per bedroom per year     0
No. bedrooms                   0
Distance from LSE              0
Crime No.                      0
LSE accommodation              0
dtype: int64
  • We can see that we do not have any missing values for any of the data of the 764 properties analysed.
  • Notice that the maximum Distance from LSE is in accordance to our filtering - 4. However, the maximum number of bedrooms is 7, but we filtered for that to be 6. Let's find these properties and figure out why is that:
In [ ]:
# Open the initial json file with the long dictionary to be able to see the links and all other info: 
with open('data/private_property_info.json') as f:
    props_check = json.load(f)

for prop in props_check:
    if int(props_check[prop][-2]["Bedrooms"]) > 6:
        display(props_check[prop])
['https://www.openrent.co.uk/2022187',
 '£1,296 per month',
 '1.61  km',
 'room in a shared flat',
 'SE1',
 {'Bedrooms': '7', 'Bathrooms': '7', 'Max Tenants': '1'},
 [51.501884, -0.1038921]]
['https://www.openrent.co.uk/2018779',
 '£900 per month',
 '3.78  km',
 'room in a shared house',
 'NW8',
 {'Bedrooms': '7', 'Bathrooms': '5', 'Max Tenants': '1'},
 [51.525555, -0.1680249]]
  • We can see that there are 2 properties with more than 6 bedrooms. By observing their characteristics we can see that these are actually shared bedrooms in a 7-bedroom flat and simply the website's filtering does not count these as 7 bedrooms as only one is available for rent.
    • Since we would like to narrow down our analysis to maximum 6-bedroom flats, we will exclude these 2 properties as generally our assumption is that students would consider sharing a flat with maximum 5 other people.
In [6]:
all_i = all_i[all_i["No. bedrooms"] <= 6]
all_i.describe().round(2)
Out[6]:
Price per month Price per bedroom per month Price per bedroom per year No. bedrooms Distance from LSE Crime No.
count 762.00 762.00 762.00 762.00 762.00 762.00
mean 2642.15 1815.88 21735.93 2.07 2.80 2641.45
std 1436.02 856.99 10334.70 1.19 0.87 1494.97
min 650.00 650.00 7800.00 1.00 0.17 1124.00
25% 1517.75 1200.00 14400.00 1.00 2.19 1765.00
50% 2450.00 1589.50 19074.00 2.00 3.03 2249.00
75% 3301.50 2275.00 27300.00 3.00 3.48 2631.75
max 8333.00 8333.00 99996.00 6.00 4.00 7873.00
  • Now everything is according to our initial assumptions and we have 762 (=764-2) properties to analyse.
  • The standard deviation of the price columns is quite significant, but so are the standard deviations of the number of bedrooms, distance from LSE and number of crimes. Let's investigate the effect of these factors on the price:

4.2. General Correlation¶

In [7]:
# Make the display panel fit the whole graph:
from IPython.display import display, HTML
display(HTML("<style>.output { height: auto !important; }</style>"))

# Find the correlations of the columns corresponding to price per bedroom per month, crimes, no. of bedrooms and distance from LSE.
# Exclude LSE accommodations for now.
corr_matrix = all_i.loc[all_i['LSE accommodation']==False, ['Price per bedroom per month', "No. bedrooms", "Distance from LSE", "Crime No."]].corr()
# Visualise the matrix via a heatmap - different shade of the colour depending of the significance of the correlation.

plt.figure(figsize=(6, 5))

# Create the heatmap and add the values and round them to 2 decimal places.
# Also center around 0 so that values of negative correlations and positive ones are shades of different colours:
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", center=0, annot_kws = {'size': 8, 'fontweight': "bold"}, vmin=-1) # The last parameter is so that we set the min value of the bar.
plt.xticks(rotation = 30, ha = "right", fontsize = 10)
plt.yticks(rotation = 0, fontsize = 10)
plt.title('Correlation Matrix of the Data', fontsize = 10, fontweight='bold')
plt.tight_layout()

plt.show()
No description has been provided for this image
  • We observe notable negative correlation between the price per bedroom of flats and number of bedrooms in the flat, i.e. the higher the number of bedrooms, the cheaper it is for a single bedroom. We also note that flats with higher number of bedroom tend to be on average further from LSE campus and in safer areas; this correlation is, however, relatively small - -0.1 and 0.1 respectively.
  • Furthermore, as expected, longer distance from LSE is associated with lower price, but that relationship is not too significant. Something important to note is that distance from LSE is highly negatively correlated with crime numbers, implying that LSE is generally in a less safe area.
    • Potentially due to that confounder (distance from LSE), we observe the positive correlation between crime number and price.

4.3. Effects of no. bedrooms on cost per bedroom¶

One of the questions we asked was what effect does the number of bedrooms in a flat have on the price per bedroom of the flat. To observe this, we use a boxplot to show the distribution of price per bedroom per month by total number of bedrooms in the flat:

In [8]:
plt.figure(figsize=(8, 6))
# We plot only the listed properties on OpenRent, so we exclude the LSE accommodations here:
ax = sns.boxplot(all_i[all_i['LSE accommodation']==False],x='No. bedrooms',y='Price per bedroom per month',palette = 'tab10')
ax.set_title('Distribution of price per bedroom per month by number of bedrooms in property', fontdict={'fontsize': 12, 'fontweight': 'bold'})
ax.set_ylabel("Price per bedroom per month (£)", fontsize = 12)
ax.set_xlabel("No. bedrooms", fontsize = 12)
ax.grid(axis='y', linestyle='--', alpha=0.5)
ax.plot()
Out[8]:
[]
No description has been provided for this image

From this we can observe that the overall distribution of price per bedroom per month appears to become 'lower' for each additional bedroom added, but the lower bound of each distribution is still in approximately the same position - below £1000. The higher upper quartiles and upper outliers of the properties with fewer bedrooms are likely a result of 'luxury' properties with higher rent typically having fewer bedrooms.

It is important to note that *4-bedroom flats* seem to be on the lowest end - as can be seen in the summary statistics table below - they have the lowest median and the lowest standard deviation among all No.-bedrooms properties except the 6-bedroom ones. However, we have only 9 observations for these, so we cannot draw firm conclusions about the accuracy of their distribution.

In [11]:
all_i[['No. bedrooms', 'Price per bedroom per month']].groupby('No. bedrooms').describe().round(2)
Out[11]:
Price per bedroom per month
count mean std min 25% 50% 75% max
No. bedrooms
1 309.0 2347.18 932.79 700.0 1795.0 2250.0 2730.00 8333.00
2 239.0 1684.15 594.82 866.5 1300.0 1500.0 1897.75 4166.50
3 112.0 1290.78 422.48 700.0 1000.0 1200.0 1433.33 2777.67
4 63.0 1058.94 264.39 650.0 862.5 1000.0 1150.00 2000.00
5 30.0 1224.33 446.09 715.0 997.0 1150.0 1267.25 2383.00
6 9.0 877.81 45.36 860.0 860.0 860.0 860.00 997.00
  • It worth noting the number of available flats with each number of bedrooms, shown below. There are hundreds of available flats with 1, 2 or 3 bedrooms, but very few with 6 bedrooms.
  • It is also important to consider that price per bedroom does not take into account the share of shared spaces that is being paid for. The fewer the number of tenants, the greater the share of shared spaces that is being paid for by the rent, so there is additional 'value' to living in a property with fewer bedrooms.
    • A possible extension is to take the square meters of the whole flat and of the room itself and to develop a model which accounts for the respective proportion of the common areas.
  • Let's visualise the numbers of available properties depending on the number of bedrooms and the respective percentages from the total number of properties to gain an idea about the representation of each type in our dataset:
In [12]:
fig, ax = plt.subplots(1, 2, figsize=(16, 8))

ax_1 = sns.barplot(data = all_i[all_i['LSE accommodation']==False]['No. bedrooms'].value_counts().reset_index(),x='No. bedrooms',y='count', ax = ax[0], palette = 'tab10')
for i in ax_1.containers:
    ax_1.bar_label(i,)
ax_1.set_ylabel('No. available properties', fontsize = 12)
ax_1.set_xlabel("No. bedrooms", fontsize = 12)
ax_1.set_title('Number of Available Properties by No. of Bedrooms', fontdict={'fontsize': 14, 'fontweight': 'bold'})

ax[1].axis("equal")
ax[1].set_title('Percentage (%) of Available Properties by No. of Bedrooms', fontsize = 14, fontweight = 'bold')

labels = all_i[all_i['LSE accommodation']==False]["No. bedrooms"].value_counts().index
numbers = all_i[all_i['LSE accommodation']==False]["No. bedrooms"].value_counts().values

sections, nums, percentages = ax[1].pie(numbers, autopct = '%1.1f%%')
ax[1].legend(labels, loc='upper right', fontsize = 12)

for percentage in percentages:
    percentage.set_fontsize(12)
    percentage.set_fontweight('bold')

plt.show()
No description has been provided for this image

4.4. Number of affordable properties and safety by postcode¶

We also set out to find whether there are postcodes which have more available affordable properties. Below is a bar plot of the number of properties each outcode has that have a price per bedroom per month at or below £1400 (the highest LSE accommodation single room price is £1377).

In [13]:
plt.figure(figsize=(10, 6))

affordable_props = all_i[all_i['LSE accommodation']==False][all_i['Price per bedroom per month']<1400][['Outcode','Price per bedroom per month']].groupby('Outcode').count().reset_index().sort_values('Price per bedroom per month')
affordable_props_crimerate = all_i[['Outcode','Crime No.']].groupby('Outcode').mean()
affordable_props_combined = affordable_props.join(affordable_props_crimerate, how='left', on='Outcode').rename({'Price per bedroom per month':'Number of Available Properties below £1400 per Bedroom','Crime No.':'Avg Crime No. in Outcode'},axis=1)

ax = sns.barplot(affordable_props_combined,x='Outcode',y='Number of Available Properties below £1400 per Bedroom',hue = 'Avg Crime No. in Outcode')

ax.set_xticklabels(list(affordable_props_combined['Outcode']), rotation=90)
ax.set_xlabel("Outcode", fontsize = 12)
ax.set_ylabel("Number of Properties below £1400 per Bedroom", fontsize = 12)
ax.set_title("Number of Affordable Properties by Area", fontsize = 14, fontweight = 'bold')
ax.grid(axis='y', linestyle='--', alpha=0.5)
ax.plot()
Out[13]:
[]
No description has been provided for this image

Let's observe the crime numbers for the 5 areas with highest number of affordable properties:

In [27]:
affordable_props_combined.sort_values(by='Number of Available Properties below £1400 per Bedroom', ascending=False).reset_index().round(2)[:5]
Out[27]:
index Outcode Number of Available Properties below £1400 per Bedroom Avg Crime No. in Outcode
0 14 NW1 44 2341.62
1 11 N1 43 2194.99
2 16 SE1 33 2282.69
3 0 E1 28 2361.83
4 22 SW1V 21 1530.25

4.5. Effects of distance from LSE on price¶

4.5.1. Regression analysis¶
  • A useful tool for determining the effects of various components on a single resulting variable is regression analysis. In this application, as the independent variables we take the price per bedroom per month as the dependent variable, and no. bedrooms, distance from LSE, crimes number and a dummy variable taking the value 1 if the property is LSE accommodation.
  • To do this, first we create a dataframe that only includes the data that will be used for the regression. We take the columns mentioned above, remap the LSE accommodation column so that it takes the value 1 and 0 instead of True and False for simplicity in the regression, and rename the columns to remove any whitespace.
In [14]:
ols_df = all_i[['Price per bedroom per month','No. bedrooms','Distance from LSE','Crime No.','LSE accommodation']]
ols_df['LSE accommodation'] = ols_df['LSE accommodation'].astype('int')
ols_df = ols_df.rename(columns={'Price per bedroom per month':'price_per_bedroom_per_month','No. bedrooms':'no_bedrooms','Distance from LSE':'distance_from_LSE','Crime No.':'crime_no','LSE accommodation':'LSE_accommodation'})
  • We can now use statsmodels' ols functions to run regression on the dataframe created above.
In [15]:
result = sm.ols(formula='price_per_bedroom_per_month ~ no_bedrooms + distance_from_LSE + crime_no + LSE_accommodation', 
                data=ols_df).fit(cov_type='HC0') # Use robust standard errors
print(result.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:     price_per_bedroom_per_month   R-squared:                       0.308
Model:                                     OLS   Adj. R-squared:                  0.304
Method:                          Least Squares   F-statistic:                     132.3
Date:                         Thu, 02 May 2024   Prob (F-statistic):           1.15e-85
Time:                                 10:33:11   Log-Likelihood:                -6086.8
No. Observations:                          762   AIC:                         1.218e+04
Df Residuals:                              757   BIC:                         1.221e+04
Df Model:                                    4                                         
Covariance Type:                           HC0                                         
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept          2684.0227    195.019     13.763      0.000    2301.793    3066.252
no_bedrooms        -376.5173     20.821    -18.084      0.000    -417.326    -335.709
distance_from_LSE   -62.7628     44.895     -1.398      0.162    -150.756      25.230
crime_no              0.0400      0.036      1.114      0.265      -0.030       0.110
LSE_accommodation -1241.6619     85.828    -14.467      0.000   -1409.882   -1073.441
==============================================================================
Omnibus:                      427.044   Durbin-Watson:                   1.776
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5193.308
Skew:                           2.273   Prob(JB):                         0.00
Kurtosis:                      14.954   Cond. No.                     2.62e+04
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)
[2] The condition number is large, 2.62e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

It is important to recognise that there is no real interpretation to the intercept here. It is the hypothetical price per bedroom of a zero bedroom flat located at LSE's campus in an area without crimes, and thus has no real world meaning.

4.5.2. Regression results¶

This result shows that, holding all other variables equal, on average for each additional kilometre from LSE a property is, the price per bedroom per month reduces by £58. This is perhaps less than we would expect, and so perhaps it may be worth it for students to pick closer properties, as they may be able to save money by walking instead of spending money on public transport.

However, this coefficient will vary by the direction from LSE where the property is located - properties to the west of LSE are closer to Covent Garden/Soho, so are typically more expensive, and the same is true for properties to the east, which is closer to the City of London. This implies that the properties to the north and south of LSE, viewed as less desirable areas than central London, likely are more than an average of £58 per bedroom per month less per kilometre further away.

In addition, this coefficient is likely inaccurate as a result of the high multicollinearity between distance from LSE and crime numbers, as earlier seen in the correlation matrix by a correlation coefficient of -0.7. The areas closest to LSE have higher crime numbers, meaning there is little variance between the two regressors. In this situation, when we care about one of the coefficients instead of the two both being controls, the solution is to find a more parsimonious solution by dropping one of the controls.

There is likely to be some reverse causality present in the crimes number coefficient, as crimes numbers are likely to be higher in areas with higher rental costs, as the criminals are aware of the higher rewards from crime in rich areas (luxury goods to burgle, more money in wallets etc.). This is likely to be the reason for both the positive coefficient on the crime number variable, and the high correlation between crime number and distance from LSE. For this reason, crime number will be dropped from the next regression.

4.5.3. Further regression evaluation¶

There is one important 'omitted variable' that we omit in this regression, which is the 'luxury' of the property. This is likely to create bias, as it is certainly positively correlated with prices, and highly likely to be correlated with some of the regressors.

For example, 'luxury' is typically associated with a lower number of bedrooms, with those bedrooms taking up more space. The omission of luxury is likely to cause negative bias in the coefficient on the number of bedrooms. Indeed, we see that the coefficient on the number of bedrooms is -376, meaning that on average, for each additional bedroom in the property the price per bedroom per month reduces by £376. This is likely inflated by this OVB.

The lack of a way to measure the luxury of property is also likely a reason as to why the negative coefficient on LSE is so large - whilst we do expect the LSE accommodation to be cheaper than comparable private rental properties, the average reduction of £985 per month is higher than we expect. This is likely a result of the average property being more luxurious than LSE's halls rooms, hence priced higher.

Additionally, there is likely to be some reverse causality present in the crimes number coefficient, as crimes numbers are likely to be higher in areas with higher rental costs, as the criminals are aware of the higher rewards from crime in rich areas (luxury goods to burgle, more money in wallets etc.). This is likely to be the reason for the positive coefficient on the crime number variable.

Typically to reduce these biases, we would include 'luxury' as a regressor, and use instrumental variables or 2 stage least squares to take out the reverse causality in crimes number. However, in this situation, we are restricted by the fairly limited data we have collected, and by the lack of any thorough numeric way to measure the luxury of a property.

It could be beneficial to restrict our data to the more 'affordable' observed properties, in order to attempt to remove 'high luxury' properties. However, this would mean excluding some properties that are expensive for reasons other than luxury, and keep in some more luxury properties further from LSE.

4.5.4. Regression excluding crime numbers and properties over £2000¶
  • The most expensive type of room in LSE accommodation is just below £2000 per month (Urbanest Westminster Bridge Student Accommodation - between £420.03 and £458.32 a week for a Single Studio). As a result, we restrict the properties to below £2000 so that we have strong representation among the private properties of what one can get in LSE accommodation. Although we minimise bias and maximise the relevance, there are still some limitations to that analysis, but more on that in section 5.2.
In [16]:
ols_df_afford = all_i[all_i['Price per bedroom per month']<2000][['Price per bedroom per month','No. bedrooms','Distance from LSE','Crime No.','LSE accommodation']]
ols_df_afford['LSE accommodation'] = ols_df_afford['LSE accommodation'].astype('int')
ols_df_afford = ols_df_afford.rename(columns={'Price per bedroom per month':'price_per_bedroom_per_month','No. bedrooms':'no_bedrooms','Distance from LSE':'distance_from_LSE','Crime No.':'crime_no','LSE accommodation':'LSE_accommodation'})
In [17]:
result = sm.ols(formula='price_per_bedroom_per_month ~ no_bedrooms + distance_from_LSE + LSE_accommodation', 
                data=ols_df_afford).fit(cov_type='HC0') # Use robust standard errors
print(result.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:     price_per_bedroom_per_month   R-squared:                       0.349
Model:                                     OLS   Adj. R-squared:                  0.345
Method:                          Least Squares   F-statistic:                     115.5
Date:                         Thu, 02 May 2024   Prob (F-statistic):           7.99e-57
Time:                                 10:33:16   Log-Likelihood:                -3525.1
No. Observations:                          503   AIC:                             7058.
Df Residuals:                              499   BIC:                             7075.
Df Model:                                    3                                         
Covariance Type:                           HC0                                         
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept          1890.9553     42.478     44.516      0.000    1807.701    1974.210
no_bedrooms        -155.6817      9.788    -15.905      0.000    -174.866    -136.497
distance_from_LSE   -56.7037     13.270     -4.273      0.000     -82.712     -30.696
LSE_accommodation  -469.2789     32.730    -14.338      0.000    -533.429    -405.129
==============================================================================
Omnibus:                        0.936   Durbin-Watson:                   1.815
Prob(Omnibus):                  0.626   Jarque-Bera (JB):                1.020
Skew:                           0.060   Prob(JB):                        0.600
Kurtosis:                       2.815   Cond. No.                         29.6
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)

We now have a regression that is more likely to give representative results. All coefficients have a statistically significant t-stat despite the fewer observations, and we no longer have a warning for multicollinearity.

Again, of course the intercept has no interpretation, especially given the truncated data.

The issue we are likely to have is that the truncation will have removed some non-luxury properties in more expensive areas, and kept in some more luxury properties in further out areas. This will go some way to explain the lower coefficient on the distance from LSE regressor.

The coefficient on number of bedrooms is likely more accurate, as it is comparing more like-for-like properties in terms of luxury. This coefficient implies that on average, for each additional room a flat has, the average price per bedroom per month reduces by £156.

The coefficient on LSE accommodation is also likely more accurate here. It suggests that LSE accomdation is on average £469 per bedroom per month cheaper than privately rented property, holding fixed distance from LSE and number of bedrooms. A limitation to this is that LSE accommodation is considered as 1-bedroom for the purposes of the regression, but in reality it is rental of a single bedroom in a much larger building, with shared amenities such as kitchens and bathrooms. This means the coefficient is the difference between LSE accommodation and single bedroom flats per month. Taking into account the coefficient on number of bedrooms, it possible that flats with 3 or more bedrooms may be more affordable than LSE accommodation per person per month. This is assuming that number of bedrooms affects price per bedroom per month linearly, when in reality the change from 1 to 2 bedrooms likely has a much bigger effect on price per bedroom per month than the change from 3 to 4 or from 5 to 6.

Overall, whilst this regression has resolved some of the issues of the initial regression, it is still flawed, and so we must be cautious in taking its results at face value.

4.6. Comparison to LSE accommodation¶

Now let's analyse the difference between LSE accommodation and private flats:

4.6.1. Regression Analysis and Results¶
  • Let's first run a regression of distance from LSE on price by controlling only for LSE accommodation and including the interaction between these 2 variables, so that we see the impact of distance on price separately between private properties and LSE accommodations.
    • We should expect to observe the distance to better explain the price changes for LSE accommodation than for private properties. The reason for this is that LSE accommodation is only for LSE students, for whom one of the main factors to choose accommodation is distance from campus. Whereas, there are many potential renters for private properties with many different preferences. Therefore, for them distance from LSE has less weight in determining the market prices.
      • We can make conclusions about that based on the different R^2.
In [18]:
ols_df_afford_i = all_i[all_i['Price per bedroom per month']<2000][['Price per bedroom per month','No. bedrooms','Distance from LSE','Crime No.','LSE accommodation']]
ols_df_afford_i['LSE accommodation'] = ols_df_afford_i['LSE accommodation'].astype('int')
ols_df_afford_i['interaction'] = ols_df_afford_i['LSE accommodation'] * ols_df_afford_i['Price per bedroom per month']
ols_df_afford_i = ols_df_afford_i.rename(columns={'Price per bedroom per month':'price_per_bedroom_per_month','No. bedrooms':'no_bedrooms','Distance from LSE':'distance_from_LSE','Crime No.':'crime_no','LSE accommodation':'LSE_accommodation'})
In [19]:
result_i = sm.ols(formula='price_per_bedroom_per_month ~ distance_from_LSE + LSE_accommodation + interaction', 
                  data=ols_df_afford_i).fit(cov_type='HC0') # Use robust standard errors
print(result_i.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:     price_per_bedroom_per_month   R-squared:                       0.039
Model:                                     OLS   Adj. R-squared:                  0.033
Method:                          Least Squares   F-statistic:                     43.81
Date:                         Thu, 02 May 2024   Prob (F-statistic):           3.81e-25
Time:                                 10:33:21   Log-Likelihood:                -3623.0
No. Observations:                          503   AIC:                             7254.
Df Residuals:                              499   BIC:                             7271.
Df Model:                                    3                                         
Covariance Type:                           HC0                                         
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept          1550.7842     50.403     30.768      0.000    1451.997    1649.572
distance_from_LSE   -73.0227     16.969     -4.303      0.000    -106.281     -39.765
LSE_accommodation -1075.0654    156.209     -6.882      0.000   -1381.230    -768.901
interaction           0.6895      0.135      5.093      0.000       0.424       0.955
==============================================================================
Omnibus:                       32.996   Durbin-Watson:                   1.731
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               16.228
Skew:                           0.247   Prob(JB):                     0.000299
Kurtosis:                       2.272   Cond. No.                     1.58e+04
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)
[2] The condition number is large, 1.58e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
  • As can be seen on the scatterplot below, there is less variation in Y (price) and thus in the standard error for lower X (Distance from LSE), and more variation for higher X.
    • This is the case because LSE is in a central area where one can hardly find cheaper properties, while if we expand the circle, we can find both more expensive and cheaper properties. This is because the lower the distance from LSE, the lower the different number of areas analysed. However, when we increase the radius we observe many areas - some of them cheaper on average due to different reasons, some of them expensive.
      • For example, we can see that within a 0.5km radius of LSE there are no properties below £2000 per bedroom per month. This is because that is considering mostly properties in LSE's area (outcode), where on average it is expensive to live.
      • When we increase the radius to say 1.5km, for instance, we may be considering a few more areas - Covent Garden, Farringdon, South Bank, etc. While there are some more affordable properties in these areas, it is generally still quite expensive.
      • But if we look at areas on the 4km circumference, these may be in Belgravia and Westminster (quite expensive areas to live in), but also in Elephant&Castle and Walworth (cheaper on average areas).
    • This causes more variation with higher distance mainly due to number of areas observed, which is a strong indication of *heteroskedasticity*. As a result, we should use robust standard errors, which we incorporate with the argument cov_type='HC0' within the regressions.
4.6.2. Visualisation of regressed dataset¶
  • Let's first visualise all data points for the regression excluding properties over £2000. We use plotly.express so that we can hover the scattered data and the lines of best fit in order to see the name of the property and more about the data:
    • Note that if the graph below does not show, one can remove or add the argument 'notebook' in graph.show() depending on the platform one uses. In the repository of this projet one can also find the HTML file with the output from executing the code.
In [20]:
import plotly.express as pltx

afford_lse_accomm = all_i[all_i['Price per bedroom per month']<2000][all_i['LSE accommodation']]
afford_priv_prop = all_i[all_i['Price per bedroom per month']<2000][all_i['LSE accommodation'] == False]

graph = pltx.scatter(all_i[all_i['Price per bedroom per month']<2000], x = "Distance from LSE", y = "Price per bedroom per month", 
                     trendline='ols', hover_data=["Name", "Outcode"], color="LSE accommodation",
                     title = "<b>Comparison between prices of privately rented property and LSE accommodation per month</b>")

# Find the slope of the ols line of best fit and add annotation:
slope, _ = np.polyfit(afford_priv_prop["Distance from LSE"], afford_priv_prop["Price per bedroom per month"], 1)

graph.add_annotation(x = 0.7, y = 1500, text=f"<b>Slope = {slope:.2f}</b>", showarrow=True, arrowhead=2, ax = -20, ay = -80, font = dict(size = 12))
graph.add_shape(type="line", x0=0.4, y0=1200, x1=1.5, y1=1200, line=dict(color="black", width=1))
graph.add_shape(type="line", x0=1.5, y0=1200, x1=1.5, y1=590, line=dict(color="black", width=1))

graph.update_layout(title_x = 0.5) # Center the title

graph.show('notebook')
  • The slopes lines of best fit that we plot are the average change in price for an extra km from LSE separately for private properties and for LSE accommodations.
  • R^2 for LSE accommodations is approximately 0.40 while for private properties it is 0.03. This confirms our hypothesis from earlier that LSE accommodation prices more consistently decrease with distance from campus increasing again because LSE accommodations revolve around LSE, while distance from LSE may not be such an important factor in setting private properties' prices as they are not only for LSE students.
  • Note that generally LSE accommodations are on the lower bound of price given a certain distance from campus with 1 small exception - the outlier Nutford House (can be seen by hovering over the furthest LSE accommodation from campus - marked in red). Excluding that we can see that up to 1.5km from LSE, there aren't many properties below £1200, and yet all of these except one are LSE accommodations (the area is bounded by the black lines above).
4.6.3. Visualisation of the whole dataset¶
In [18]:
plt.figure(figsize=(12, 8))
ax = sns.scatterplot(data = all_i, x = 'Distance from LSE', y='Price per bedroom per month',
                hue = 'LSE accommodation',style = 'LSE accommodation',size = 'LSE accommodation',sizes = (250,50))
ax.set_xlabel("Distance from LSE (KM)", fontsize = 12)
ax.set_ylabel("Price per bedroom per month (£)", fontsize = 12)
ax.set_title("Comparison between prices of privately rented property and LSE accommodation per month", fontsize = 14, fontweight = 'bold')
ax.grid(axis='y', linestyle='--', alpha=0.5)
ax.plot()
Out[18]:
[]
No description has been provided for this image

From this plot, we can see that the price per bedroom per month for LSE halls (shown with orange crosses above) is among the cheapest options per bedroom per month for the distance from LSE it is situated at. Whilst there are some cheaper options, it is worth considering that the prices for LSE accommodation also include utility bills, and in some halls catering too.

It is also important to remember that LSE halls operate on 39 or 40 week contracts, where students do not live in or pay for the halls over the summer break. Whilst inconvenient for the smaller percentage of students who wish to stay in London over the summer, this saves money for the large majority of students who travel back to their families for the summer. Next, we compare the overall yearly costs between privately rented property and LSE accommodation per year.

In [19]:
plt.figure(figsize=(12, 8))
ax = sns.scatterplot(data = all_i, x = 'Distance from LSE', y='Price per bedroom per year',
                hue = 'LSE accommodation',style = 'LSE accommodation',size = 'LSE accommodation',sizes = (250,50))
ax.set_xlabel("Distance from LSE (KM)", fontsize = 12)
ax.set_ylabel("Price per bedroom per year (£)", fontsize = 12)
ax.set_title("Comparison between prices of privately rented property and LSE accommodation per year", fontsize = 14, fontweight = 'bold')
ax.grid(axis='y', linestyle='--', alpha=0.5)
ax.plot()
Out[19]:
[]
No description has been provided for this image

Here, we can see that the price for the overall contract for LSE accommodation is amongst the very lowest available of the property data we collected, showing the issues for students who must face moving out of LSE accommodation for their second year.

It is important to remember that this does not take into account the lack of accommodation over the summer, but this is of little importance for the majority of students. In addition, there is also the benefit to LSE halls of no utility bills and catering being included. Overall, it is easy to see the value proposition of LSE accommodation.

5. Conclusion¶

5.1. Summary of Results¶

  • Average price per bedroom per month decreases with additional bedrooms up to the 4-bedroom flats, after which there is a small spike in prices for 5-bedroom flats, and 6-bedroom ones are the chepest - the volatility in the last 2 is potentially due to the scarcity of 5 and 6 bedroom flats. 1, 2 and 3 bedroom flats are a much larger proportion of the market, so even 4 bedroom flats may be hard to find. The ‘floor’ of each distribution is at approximately the same position for each number of bedrooms.
  • There are very few outcodes with high numbers of ‘affordable’ properties (below £1400 per bedroom per month). The outcodes with the most affordable properties are NW1, N1, SE1, E1 and SW1V. The outcodes that do have many affordable properties are not typically outcodes with greater amounts of crime.
  • Crime numbers are highly negatively correlated with distance from LSE, implying that crime numbers around LSE are typically much higher. This makes linking property prices and crime numbers difficult to do effectively.
  • Our restricted regression gave the result that on average, each additional km from LSE reduces cost per bedroom per month by £56. However, the results from this regression are hard to be classified as causal, given the nature of the data and lack of accounting for ‘luxury’ of properties. It gives, however, important and relevant insight into the relationships between the different factors in determining prices.
  • Prices for LSE accommodation are very competitive on a per month basis given their locations, and even better value for students given their shorter (39 or 40 week) contracts, so moving into privately rented accommodation is likely to be much more expensive.

5.2. Interpretation of Results¶

  • For a financially constrained LSE student, the best course of action is likely to try to secure a place at one of LSE’s student halls as they are one of the cheapest options given the proximity to campus.
  • However, as places for second-year and above students are limited, their best course of action is likely to seek out flats with 3-4 bedrooms in areas with more affordable options, such as NW1, N1 and SE1, which are also relatively safe as we saw in 4.4. This is because crime numbers tend to be higher closer to LSE, where private flats are generally more expensive as well.
    • Therefore, the areas of Camden, Islington, and Lambeth & Southwark could be potential target areas for students.
  • Further insights were given above into the difference in price-setting fators between private flats and LSE accommodation - the main one being, proximity to LSE. The implication of this and the analysis above is that one can find a cheaper private property than LSE if it is further away from campus, but if it is as close as LSE accommodations, they are hard to beat.
    • As a result, a student with lower budget than LSE accommodation prices will have to aim at areas further from campus.

5.3. Limitations¶

  • Confounder - *luxury*:
    • A key limitation of the regression analysis, and many of the other price comparisons in the data analysis, is that we have no way of defining or designating the ‘luxury’ of a property - whether it is a fairly basic, small flat in a less desirable building or a lavishly appointed, desirable penthouse in a sought-after location. This data would help make comparisons more like-for-like, and be a helpful control in the regressions we performed. Obtaining the square-footage of a property would go some way to help this, but many properties do not have this information on the property page, and even if they did, this is still not a full picture of the ‘luxury’ of a building. To truly measure this would likely be a manual process of viewing each property and rating the quality of the apartment subjectively, which would not be a particularly time-efficient or scientific approach.
  • Incomplete data:
    • *Source*:
      • Data for this project was obtained only from OpenRent, the third largest housing website in London, as Zoopla and Rightmove both have rules against web scraping. This means we do not have complete data on London’s housing market, but rather a fairly large subset of it. Though we are fairly sure that OpenRent’s subset of the market is fairly representative, it would be optimal to have data from at least the three largest websites. More data could have been obtained by paying to gain access to these companies’ APIs, but this was beyond the budget and timeframe of this project.
    • *Scope*:
      • One further consideration that could be made is to obtain data for properties that are further away from LSE, but have convenient transport connections to the campus, for example are nearby stations on the Central or Piccadilly lines. This could present more affordable options for students without compromising on commute lengths. Whilst this data is available on property websites, we did not collect it as it would have involved scraping data for a much wider radius, which would have added a significant amount of time to the data scraping process.
  • Comparison between LSE accommodations and private flats:
    • When we gathered information for the costs of living in LSE accommodation, we only took the costs of renting a one-bedroom, non-ensuite room. Whilst this is by far the most popular arrangement for living in LSE halls, there are other options, both more expensive and cheaper, that could be used for comparison. Also, some LSE accommodation prices include catering, while all include utility bills.
    • There are also privately owned student halls in London that could be an option for students moving out of LSE halls, but gaining data on these would have been a manual effort rather than automated web scraping or API calls, and so beyond the scope of this project.
  • Timeframe of the data:
    • Another limitation for this project is that the data is a snapshot - we only collected data at one specific moment 24/04/2024, and analysed that ‘snapshot’. This means we do not have information on how prices change across the year, and therefore whether there is a time of year that may be cheapest to hunt for accommodation. This could also help give us a better view on the relationship between rental prices and crime statistics. This data could be gathered over a longer term study, or by collaborating directly with the housing websites.
    • Crime information was only gathered for the period of one month. Whilst we are fairly sure that crime information for a month is representative and comparable enough to be a strong measure of crime levels, as it is within the same city and so unlikely to be many differences due to external factors, it would be more thorough to include historical crime data too, for example the year prior or more. This could be added fairly simply using the functions we defined, but the API calls necessary to compile the data would take at least ten hours.

5.4. Potential Further Analysis¶

  • A more in depth version of this study would use data over time from all three major property sites, with some sort of parameter for ‘luxury’ included in the dataframe. Working directly with the websites (Zoopla, Rightmove and Openrent) would be essential for this, for instance by gaining access to their respective API and Automated Datafeed (ADF) systems. We would also include more types of LSE accommodation room and other privately owned student halls in the comparison. However, the use of these types of data would require much more in depth statistical analysis, and a more complex dataframe.
  • Finally, this project’s findings are accurate for LSE itself, and would be almost exactly representative for the neighbouring King’s College London. While it does not have particular relevance for housing markets in other areas of London, the same methodology, and the majority of the same code, could be used for similar projects for other areas with some minor tweaks. The OpenRent webscraping and the crime numbers API code has been automated to the extent that one just puts the postcode and area name of one's university or other insitution of interest, and the whole report will run for that location. Obviously, LSE halls webscraping code and the interpretation of results will not be the same, should the reader choose a different location.
  • We believe this report provides a strong insight into the London housing market and the dynamics between factors like Crime numbers, Distance from LSE, Number of bedrooms, and type of properties. It is a strong foundation for further analysis, which builds on our methodology, or adapts our approach to different areas. It could be relevant for anyone considering flat-hunting in London, or even the UK more broadly.

6. References¶

  • Motivation and data acqusition approach:
    • https://simplylondonrelocation.com/knowledge-base/2023-london-rental-price-review/
    • https://www.lse.ac.uk/student-life/accommodation/private-housing
    • https://www.homeviews.com/blog/10-best-rental-websites-for-london-property
    • https://www.openrent.co.uk/properties-to-rent/
    • https://www.police.uk/
    • https://www.lse.ac.uk/student-life/accommodation/search-accommodation?collection=lse-accommodation
  • Selenium and multithreading for webscraping:
    • https://selenium-python.readthedocs.io/
    • https://medium.com/@shashwat_ds/a-tiny-multi-threaded-job-queue-in-30-lines-of-python-a344c3f3f7f0
    • https://www.linkedin.com/pulse/multithreading-vs-multiprocessing-asyncio-code-examples-kaushik-yxgjc/
    • https://www.geeksforgeeks.org/multithreading-or-multiprocessing-with-python-and-selenium/
    • https://www.tutorialspoint.com/click-the-button-by-text-using-python-and-selenium#:~:text=To%20click%20a%20button%20using,user%20clicking%20on%20the%20button.
  • Extra syntax reference:
    • https://stackoverflow.com/questions/2465921/how-to-copy-a-dictionary-and-only-edit-the-copy
  • API data and postcodes gathering:
    • https://data.police.uk/docs/
    • https://postcodes.io/docs
  • Visualisation:
    • https://plotly.com/python/line-and-scatter/